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Abstract 

We derive exponential tail inequalities for sums of random matrices with no dependence on 
the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound 
and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity 
that can be small even when the dimension is large or infinite. Some applications to principal 
component analysis and approximate matrix multiplication are given to illustrate the utility of 
the new bounds. 
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1 Introduction 

Sums of random matrices arise in many statistical and probabilistic applications, and hence their 
concentration behavior is of significant interest. Surprisingly, the classical exponential moment 
method used to derive tail inequalities for scalar random variables carries over to the matrix 
set ting when augmented with certain matrix trace inequalities. This fact was first discovered 



by lAhlswede and Winteij (120021') . who proyed a matrix versi on of the Chernoff bound using the 



Golden-Thompson inequality (jGoldenl . Il965l : iThompsonl . trexp(^ + S) < tr(exp(^) exp(5)) 



for all symmetric matrices A and B. Later, it was demonstrated that the same technique co i ild be 



Recht , 


2009; 


Gross. 


2009; 


Oliveira . 



these results have proved invaluable in constructing and simplifying many probabilistic arguments 
concerning sums of random matrices. 

One deficiency of these previous inequalities is their explicit dependence on the dimension, which 
prevents th eir application t o infin i te dimensional spaces that aris e in a variety of data a r ialysis 
tasks {e.g., Scholkopf et al. . 1999 : Rasmussen and Williams . 20061 ; Fukumizu et al. . 2007 : Bach . 



20081 ). In this work, we prove analogous results where the dimension is replaced with a trace 
quantity that can be small even when the dimension is large or infinite. For instance, in our 
matrix generalization of Bernstein's inequality, the (normalized) trace of the second moment matrix 
appears instead of the matrix dimension. Such trace quantities can often be regarded as an intrinsic 
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notion of dimension. The price for this improvement is that the more typical exponential tail e 



-t+i( 



As t becomes large, the difference 



is replaced with a slightly weaker tail f(e* — t — f 
becomes negligible. For instance, if i > 2.6, then t(e* — t — < e~*/^. 

There are some previous w orks that give dimension- free tail inequalities in some special cases. 



■ttuaeison ana versnvmni (i^uu ) prov e expo nential tan inequali t ies tor s ums ot ranK-one matr i ces Dy 
way of a key inequality of Rudelson ( 19991 ) (see also Oliveira . 2010al ). Magen and Zouziai ( 20 111 ) 
prove tail inequalities for sums of low-rank matrices using non-commutative Khintchine moment 
inequalities, but fall short of giving an exponential tail inequality. In contrast, our results are 
proved using a natural matrix generalization of the exponential moment method. 



2 Preliminaries 



Let ^1, . . . , ^„ be random variables, and for each i = 1, . . . ,n,let Xi := . . . , ^j) be a symmetric 

matrix- valued functional of ^i, . . . , ^j. We use Ej[ • ] and shorthand for E[ • | ^i, . . . , For any 

symmetric matrix H, let Xmax{H) denote its largest eigenvalue, ex.p{H) := I + J2T=i -^'^ /^^-^ 
log(exp(/7)) := H. 

The following convex trace inequality of Lieb (j 19731 ) was also used by iTroppI (j2nilal jb[). 

Theorem 1 (|Liebl . Il973l ). For any symmetric matrix H , the function M i— )• trexp(//-|-log(M)) is 
concave in M for M ;^ 0. 

The following l emma du e to (Tropp . 2011131 ) is a matrix generalization of a scalar result due 
to iFreedmanI (119751 ) (see also IZhaneJ . l2005l ) , where the key is the invocation of Theorem [TJ We give 
the proof for completeness. 



Lemma 1 ([Troppl . l2011bl ). For any constant symmetric matrix Xq, 



E 



trexp K^Xi - ^InEi [exp(Xi)] 



i=l 



< tr exp(Xo). 



(1) 



Proof. By induction on n. The claim holds trivially for n = 0. Now fix n > 1, and assume as the 
inductive hypothesis that ([I]) holds with n replaced by n — 1. In this case. 



E 



trexp - ^logEi [exp(Xi)] 



i=l 



E 



< E 



E^ 



^n-l 



trexp wZ^i-"^ logEj [exp(Xi)] + logexp(X„) 

\j=0 i=l 
/ n—1 n 

trexp \^Xi-Y^ logEi [exp(X,)] + logE„ [exp(X„)] 



E 



i=l 
n-1 



trexp [Yj^i-Yl ^°SE» [exp(Xi)] 



vi=0 



i=l 



< tr exp(Xo) 



where the first inequality follows from Theorem [T] and Jensen's inequality, and the second inequality 
follows from the inductive hypothesis. □ 
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3 Exponential tail inequalities for sums of random matrices 
3.1 A generic inequality 

We first state a generic inequality based on Lemma [TJ This differs from e arlier approach es, which 
instead combine Markov's inequality with a result similar to Lem.m.a\T\{e.g., Troppl . 2011a . Theorem 
3.6). 



Theorem 2. For any 77 G M and any t > 0, 



Pr 



vY.^'- E^og^^ [^MvX^)] ] > t 

\ i=l i=l 



)'■] 


< tr 







-77^X, + ^logE, [eMvXi)] 

1=1 i=l 



ie'-t-ir\ 



Proof. Fix a constant matrix Xq, and let A := riY^^=oXi — ^"^^ logEj[exp(ryXj)]. Note that 
g{x) := — X — 1 is non-negative for all a; G M and increasing for x > 0. Letting {Aj(j4)} denote 
the eigenvalues of A, we have 

Pr [Amax(A) > t] (e* - t - 1) = E [1 [A^ax(^) > t] (e* - t - 1)] 

< E [e^--(^) - An,ax(^) - 1 



< E 



^(eM-^)-A.(^)-l) 



= E [tr(exp(A) - A- I)] 
< tr(exp(Xo) + E[-.4] - I) 

where the last inequality follows from Lemma [TJ Now we take Xq — )■ so tr(exp(Xo) —/)—)• 0. □ 
3.2 Some specific bounds 

We now give some specific bounds as coroll aries of Theorem [21 Most of the est i mates used in the 
proofs are taken from previous works {e.g., Ahlswede and Winter . 2002 : Tropp . 2011al ): the main 
point here is to show how these previous techniques can be combined with Theorem [2] to yield new 
tail inequalities with no explicit dependence on the matrix dimension. 

First, we give a bound under a subgaussian-type condition on the distribution. 

Theorem 3 (Matrix subgaussian bound). // there exists a > and k > such that for all i = 
l,...,n, 

E,[X,] = 



Amax I ^^logEi[exp(r/Xi)] | < 



i=l 



E 



tr -^logEi[exp(r?Xi)] 



1=1 



< 



rj^a"^ 
2 



for all 7j > almost surely, then for any t > 0, 



Pr 



> 



i=l 



2aH 



n 



< k ■ t(e* -t-1) 
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Proof. We fix r/ := y^2t/{a'^n). By Theorem [2l we obtain 



Pr 



Amax - VX, VlogEj [exp(?7Xi)] > — 

\ n ^—^ nrj ^—^ j nr] 





< tr 


)>-] 


1 nrj 





^logE, [exp(r?XO] 



<^-(e*-t-l)-^ 



-1 



/c • t(e* -t-l) 



^1 



Now suppose 



A. 



/ n n \ 

- ^X, - — J] log E, [exp(r?X,)] < 
\n nrj j 



t 

nrj 



This imphes for every non-zero vector u 

1 v^n 



and therefore 



as required. 



Er=i log [exp(r?Xi)] ) u ^ 

< A„ 



u ' u 



nrj 



nrj 



5^1ogEi[exp(ryXi)] + 



i=l 



5^ < A^a. — 5^ log Ei [exp(ryXi)] + 



nrj 



nrj 



- 2 



nrj 



2aH 



n 



□ 



We can also give a Bernstein-type bound based on moment conditions. For simplicity, we just 
state the bound in the case that the Amax(-^i) are bounded almost surely. 

Theorem 4 (Matrix Bernstein bound). If there exists 6 > 0, cr > 0, and k > such that for all 
i = l,...,n, 

E,[X,] = 

Amax(^j) ^ b 



E 



max I — Ej [X- 



i=l 



almost surely, then for any t > 0, 



Pr 



n 



i=l 



i2aH bt 
n 6n 



<k-t{e^ -t-iy^. 



Proof. Let rj > 0. For each i = 1, . . . ,n, 



exp{rjXi) <I + rjXi + 



- Tjb — 1 



■Xf 
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and therefore 



logEi[exp(r/X,)] ^ 



7/6 — 1 



62 



E,[xf]. 



Since — x — 1 < x'^ /{2{1 — x/3)) for < x < 3, we have by Theorem [2] 

=.2 



Pr 



A 



t 

+ — 



1=1 



2(1-7/6/3) rjn 



provided that r] < 3/6. Choosing 



gives the desired bound. 



-2(l-7?6/3) 



3 / ^/2aH/n 
^ '~ b I 26t/(3n) + ^laHIn 



□ 



3.3 Discussion 

The advantage of our results here over previous exponential tail inequalities for sums of random 
matrices is the absence of explicit dependence on the matrix dimensions. Indeed, all previous tail 
inequalities using the exponential moment method (either via the Golden-Thompson inequality 

e~* when the matrices in the sum are d x 



Recht, 


20091; 


Gross. 


2009: 


TroDD, 



results also improve over the tail inequalities of lRudelson and Vershvnin (12007) in that it applie s 



mr 



to full-rank matrices, not just rank-one matrices; and also over that of Magen and ZouziasI (|2011 
in that it provides an exponential tail inequality, rather than just a polynomial tail. Thus, our 
improvements widen the applicability of these inequalities (and the matrix exponential moment 
method in general); we explore some of these in Subsection 13. 4[ 

One disadvantage of our technique is that in finite dimensional settings, the relevant trace 
quantity that replaces the dimension may turn out to be of the same order as the dimension d (an 
example of such a case is discussed next). In such cases, the resulting tail bound from Theorem [J] 
(say) of fc • — t — 1)^^ is looser than the d ■ e~* tail bound provided by earlier techniques {e.g., 
Troppl . l2011al ). 

We note that the matrix exponential moment method used here and in previous work can lead 
to a significantly suboptimal tail inequality in some cases. This was pointed out by iTVoppI (l2011al . 
Section 4.6), but we elaborate on it here further. Suppose xi, . . . ,x„ S {±1}^^ are i.i.d. random 
vectors with independent Rademacher entries — each coordinate of Xj is +1 or —1 with equal proba- 
bility. Let Xi = XixJ-I, so ElXi] = 0, Xm^,{Xi) = X^^,(E[Xf]) = d- 1, and tT{E[Xf]) = d{d-l). 
In this case, Theorem |4] implies the bound 



Pr 



A. 



'l " 
n ^ 



i=l 



n 



3n 



< dt{e^ -t- 1)-^. 



On the other hand, because the Xi have subgaussian projections, it is known that 

< 2e-*/2 



Pr 



Ar 



n ^ 



XiX- - / > 2 



lid + 16t lOd + 2t 
+ 



n 



n 
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(jLitvak et all l2005l . also see Lemma [2] in Appendix [X]). First, this latter inequality removes the 
d factor on the right-hand side. Perhaps more importantly, the deviation term t does not scale 
with d in this inequality, whereas it does in the former. Thus this latter bound provides a much 
stronger exponential tail: roughly put, P'r[Xmax{Yll^=i^i^J /''^ ~ ^) > c ■ {yj d/n + d/n) + r] < 
exp(— r2(nmin(r, T^))) for some constant c > 0; the probability bound from Theorem U] is only 
of the form exp(— ri((n/(i) min(r, r^))). The sub-optimality of Theorem H] is shared by all other 
existing tail inequalities proved using this exponential mom ent method . The issue is related to 



the asymptotic freeness of the random matrices Xi , . . . , Xn (IVoiculescul . Il99ll : iGuionnetl . l2004l ) — 
i.e., that nearly all high-order moments of random matrices vanish asymptotically — which is not 
exploited in the matrix exponential moment method. This means that the proof technique in 
the exponential moment method over-counts the contribution of high-order matrix moments that 
should have vanished. Formalizing this discrepancy would help clarify the limits of this technique, 
but the task is beyond the scope of this paper. It is also worth mentioning that asymptotic freeness 
only holds when the Xi have independent entries. For matrices with correlated entries, our bound 
is close to best possible in the worst case. 



3.4 Examples 

For a matrix M, let denote its Frobenius norm. 

If M is symmetric, then ||M||2 = max{Amax(Af ), — Amin 
respectively, the largest and smallest eigenvalues of M. 



and let ||M||2 denote its spectral norm. 
(Af)}, where Amax(^) and Amin(-^) are. 



3.4.1 Supremum of a random process 

The first example embeds a random process in a diagonal matrix to show that Theorem [3] is tight 
in certain cases. 

Example 1. Let (Zi,Z2,...) be (possibly dependent) mean-zero subgaussian random variables; 
i.e., each E[Zj] = 0, and there exists positive constants ai,a2, ■ ■ ■ such that 



E[exp(?7Zj)] < exp 



yrj G M. 



We further assume that v := supjlfj^^} < oo and k := ^"^iCrf < oo. Also, for convenience, we 
assume logfc > 1.3 (to simplify the tail inequality). 

Let X = diag(Zi, Z2, . . .) be the random diagonal matrix with the Zi on its diagonal. We have 
E[X] = 0, and 



logE[exp(r/X)] ^ diag 



so 



2 k 
A„,ax (logE[exp(r/X)]) < ^ and tr (logE[exp(r?X)]) < 



By Theorem [3l we have 



Pr 



Amax(X)>^/2^ < kt{e' - t - 



Therefore, letting t := 2(t -|- log A;) > 2.6 for r > and interpreting Amax(-^) as supjjZj}, 
Pr supjZil > 24/supjcj?| I log^^V^ + r 1 < e"^. 
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Suppose the Zi ~ AA(0, 1) are just N i.i.d. standard Gaussian random variables. Then the above 
inequaUty states that the largest of the Zi is 0(log A'^ + r) with probability at least 1 — e""^; this 
is known to be tight up to constants, so the log term cannot ge nerally be rem oved. This fact 
has been noted by previous works on matrix tail inequalities (e.5'., Tropp . 2011al ). which also use 
this example as an extreme case. We note, however, that these previous works are not applicable 
to the case of a countably infinite number of mean-zero Gaussian random variables Zi ~ AA(0, o"^) 
(or more generally, subgaussian random variables), whereas the above inequality can be applied as 
long as the sum of the a? is finite. □ 



3.4.2 Principal component analysis 

Our next two examples uses Theorem H] to give spectral norm error bounds for estimating the 
second moment matrix of a random vector from i.i.d. copies. This is relevant i n the context of 



(kernel) principal component analysis of high (or infinite) dimensional data (e.g.. IScholkopf et al. 
[1999). 

Example 2. Let xi, . . . ,Xn be i.i.d. random vectors with := E[xjXj ], K := E[ ■ 1 , and 

||a;i||2 < i almost surely for some ^ > 0. Let Xi := XixJ — H and i7„ := Z^ILi ^i^l ■ We have 

Amax(X,) < F-K,,^{E). Also, A^ax(n-1 ELl^iX,']) = A^^ax 

ii{K - U"^). By Theorem H 



(i^-r2)andE[tr(n-iEr=iE[X2])] 



Pr 



Amax (-^n 



Amin(^))^ 



n 



3n 



Since \max{—Xi) < Amax('^)i we also have 



Pr 



Therefore 



n 



3n 



< 



< 



tT{K - U^) 
X^UK - ^2) 



t(e* 



t-1) 



ix{K - U"^) 

Amax(-f^ — ^^) 



■tie' 



t-l) 



Pr 



2A„ 



,{K - U^)t ^ max{|2 



Amin(^)) Amax(^)}^ 



trjK - U^) 



< ^•2t(e*-t- 

n 3n \ I 

A similar result was given bv lZwald and Blanchard (|2006l . Lemma 1) but for Frobenius norm error 
rather than spectral norm error. This is generally incomparable to our result, although spectral 
norm error may be more appropriate in cases where the spectrum is slow to decay. □ 

We now show that combining the bound from the previous example with sharper dimension- 
dependent tail inequalities can sometimes lead to stronger results. 

Example 3. Let xi, . . . ,Xn be i.i.d. random vectors with U := K[xixJ]; let Xi := XixJ — U and 
IJn '■= Yli^=i XixJ . For any positive integer d < rank(i7), let n^^o be the orthogonal projector to 
the d-dimensional eigenspace of S corresponding to its d largest eigenvalues, and let n^^i := /— n^^o- 
We have 



1-^" ~ ^Il2 — ||n(i,o(^n — ^)nd,o||2 + 2||nrf_o(^n " ^)n(i,l||2 + ||nrf^l(i7n — ^)nrf^l||2 

< 2\\Ua,o{IJn - ^)nrf,o||2 + 2||nrf,i(r„ - U)Ud,ih. 
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We can use the tail inequalities from this work to control ||nd^i(Z'„ — i7)nrf^i||2, and use potentially 
sharper dimension-dependent inequalities to control ||n(io(-^n — ^)n(i^ol|2- 

Let Udfi := UdfiUUdfi, ^d,i ■= n^.i^^n^^i, := m'nd,iXixJUd^iy], and assume ||n(i_iXj||2 < 
id^i for all i = 1, . . . , n almost surely. Furthermore, suppose there exists 'jdfi > such that for all 
i = 1, . . . ,n and all vectors a, 



E 



exp ( a U 



d,0 



Xi 



< exp(7rf,o||a||i/2) 



— 1/2 

where ^do matrix square-root of the Moore-Penrose pseudoinverse of IJd,o- This condition 

— 1/2 

states that every projection of ^do subgaussian tails. In this case, the tail behavior of 

||nd,o(^n — ^)nrf^ol|2 should not depend on the dimensionality d. Indeed, a covering number 
argument gives 



Pr 



1^ II „v.„ / /71(i + 16t bd + t 



n 



n 



for any t > (see Lemma[2]in Appendix[A]) . Combining this with the tail inequality from Example^ 
we have (for t > 2.6) 



Pr 



ir r\\ urn I J}d±}Qt 5d + t 

- ^\\2 > 47d,0ll^l|2( \l h 



n 



n 



+ 2 



2A,nax(i^d,l-i:|l)(l0g( 



tr(iCrf,i-i:g_ ) 



)+t) 



n 



2max{^^^ - Amm(^d,l)' -^max(^d,l)} (^Ogl 



+ 



tr(i^-rf,l-I7g ) 



3n 



<4e-*/2. (2) 
□ 



Comparisons. We consider the following stylized scenario to compare the bounds from Example [2] 
and Example El 

1. The largest d eigenvalues of E are all equal to ||i^||2, and the remaining eigenvalues are smaller 
and rapidly decaying so tr(Z'(i i)/||i7||2 is small. 

2. iF and ^ are within constant factors of tr(i7) and ii{X!d,i), respectively; this simply requires 
that the squared length of any Xj never be more than a constant factor times its expected 
squared length. 

3. Amax(-^-^^) and Amax(-?^d,i --^li) are within constant factors of Amax(^)^ and Amax(^d,i)^, 
respectively; this is similar to the previous condition. 

We will also ignore constant and logarithmic factors, as well as the ^d,Q factors. The bound on 
||i7„||2 from Example [3] then becomes (roughly) 



\ \ n \ \ n n n 



(3) 
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whereas the bound from Example [2] is 

||i7||2 + ll^lh 




J (^d+itr{S,,^)/\\U\\2)y 
n n 



(4) 



The main difference between these bounds is that the deviation term t does not scale with d in ([3]), 
but it does in (j3|), so the exponential tail in the latter is much weaker, as discussed in Subsection l3.31 
We can also compare the bound from Example [3] to the case where the Xj are i.i.d. Gaussian 
random vectors with mean zero and covariance E. Arrange the Xi as columns in a matrix = 

[Xi\ ■ ■ ■ \Xn], so 



\E 



n 2 



n 



-\\A 

n 



n 1 1 2 • 



Note that An has the same distribution as U^^'^Z, where Z is a matrix of independent standard 
Gaussian random variables. The f uncti on Z i— )• \\E^^'^Z\\2 = ||A„||2 is ||2-Lipschitz in Z, so 

by Gaussian concentration ( Pisier . 19891 ). 



Pr 



<e-*. 



ii„ii2 >E[iii„ii2] + Vn^hi 

The expectation can be bounded using a result of Gordon ( 19851 . 19881 ): 

E[\\Anh]=E[\\u^/'z\\2] < ||rV2||2^+||rV2||^. 

Putting these together, we obtain 



Pr 



I " II 2 ^ II ^IL ^ 



|i:||2tr(S) ^ hWUgt tr{E)+2^/2^4E)^^^t + 2\\U\\2t 
+ 2\ 1 



n V n n 

In our stylized scenario, this roughly implies a bound on HX'nlb of the form 

d + tr{Ed,i)/\\m2^ ' 



< e 



\^\\ 



1 + 



+ 



+ ll^ll 



n 



n 



t t 
n n 



(5) 



Compared to ([3]), we see that the main difference is that t does not scale with tr(Z'rf^i)/ 1117112 in ©, 
but it does in ([3]). Therefore the bounds are comparable (up to constant and logarithmic factors) 
when the eigenspectrum of S is rapidly decaying after the first d eigenvalues. 



3.4.3 Approximate matrix multiplication 

Finally, we give an example about approximating a matrix product AB^ using non-uniform sam- 
pling of the columns of A and B. 

Example 4. Let A := [ai\ ■ ■ ■ \am] and B := [bi\ ■ ■ ■ \bm] be fixed matrices, each with m columns. 
Assume ai ^ and bi ^ for all z = 1, . . . , m. If m is very large, then the straightforward 
computation of the product AB~^ can be prohibitive. An alternative is to take a small (non- 
uniform) random sample of the columns of A and B, say ai^^bi^, . . . ,CLi^,bi^, and then compute a 
weighted sum of outer products 

1 a - b~^ 



9 



where pij > is the a priori probabiHty of choosing the column index ij E {1, . . . , m} (the actual 
val ues of the probabilit i es jOj f or i = 1, . . . ,m are given below). An analysis of this scheme was given 
by Magen and Zouziad ( 2011 ) with the stronger requirement that the number of columns sampled 
be polynomially related to the allowed failure probability. Here we give an analysis in which the 
number of columns sampled depends only logarithmically on the failure probability. 
Let Xi, . . . , Xn be i.i.d. random matrices with the discrete distribution given by 



Pr 



Pi 





bioj 







for alH = 1, . . . , m, where pi := \\ai\\2\\bi\\2/Z and Z := ^ 

1 " 

M„:=-y^X,- and M : 



Pi oc ||ai||2||6i||2 

hAU. Let 



i=i 



i=i i|aj||2||Oj||2- 
BA^ 



Note that \\Mn — M\\2 is the spectral norm error of approximating AB^ using the average of n outer 
We have the following identities: 



products X]^=i O'lj^i^/Pij, where the indices are such that ij = i Xj = aibj /pi for j = 1, 



, n. 



i=l 



tT{E[X]])=tv[^p, (-^ 




1 



aibJ 



a-ibl bittl 




ET=i (^i^ 

ET=i ha] 

m 



M 



tr(E[Xjf ) = tr 
and the following inequalities: 



i=l 

AB^BA^ 
BA^AB 





biojaib] 



E 



2 1 1 Cli 1 1 2 1 1 1 1 2 
Pi 



2Z' 



\Xj\\2 < max — 

i=l,...,mpi 



bittj 



aibJ 



i=l 

2tr{A'^ AB'^ B); 
= max = Z 

2 i=l,...,m pi 



||E[X 
|E[X 



\AB' 



iJll2 

J jll2 ^ ||^||2||-S||2^- 



< WAhrn 



This means \\Xj-M\\2 < Z + p||2||-B||2 and \\&[{Xj - Mf]\\2 < ||E[X|] - M2||2 < ||y4||2 ||B||2(Z + 
||^||2||-B||2), so Theorem m and a union bound imply 



Pr 



\Mn-M\\^ > 



2{\\A\\2\\B\\2{Z + \\A\\2\\B\\2))t {Z +\\A\\2\\B\\2)t 



n 



< 4 



Z^ - ti{A^AB'^B) 



t{e'-t-l)-\ 



\A\\2\\B\\2iZ + \\A\\2\\B\\2) , 

Let rA ■■= ||^|||^/||A||2 G [l,rank(^)] and rs := ||-B|If/||-B||2 ^ [l,rank(S)] be the numerical (or 
stable) rank of A and B, respectively. Since Z/(||^||2 II-BII2) < ll^l|F||-B||i7'/(||A||2||i?||2) = y/rATB, 
we have the simplified (but slightly looser) bound 



Pr 



\AMB 



M\\^ ^ 2 / (1 + V^^^)(log(4^/7^^) + t) ^ 2(1 + ^A^ )(log(4V?I7^) + t) 



n 



3n 



< e 
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Therefore, for any e G (0, 1) and 6 € (0, 1), if 



n> ^ + 2 (l + V^^)(log(4VrIri)+log(l/5)) 



^3 Sy e2 

then with probabiUty at least 1 — 5 over the random choice of column indices z^, . . . , Zj^, 



1 " ai^,5T 

Pij 



< ePIbll^lb- □ 

2 
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A Sums of random vector outer products 



The following lemma is a tail inequality for smallest and largest eigenvalues of the empirical covari- 
ance matrix of subgaussian rand om vectors. This result (with non-explicit constants) was originally 



obtained by Litvak et al. ( 20051 ). 
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Lemma 2. Let xi, . . . ,Xn be random vectors in such that, for some 7 > 0, 



E 



X\ , . . . , Xi—\ 



I and 



E 



exp I a X, 



Xl , . . . , Xi—l 



< exp (||a||27/2) for all a G 



for all i = 1, . . . ,n, almost surely. For all eo G (0, 1/2) and 5 G (0, 1), 



Pr 



where 



> 1 + 



i=l 



1 

l-2eo 



eeo,5,n or An 



n ^ 



XjXj < 1 



1 _ 2en ■ '^O'-^'" 



< (5 



7- 



32((ilog(l + 2/eo) + log(2/<5)) , 2 ((ilog(l + 2/eo) + log(2/5)) 



+ 



n 



n 



Remark 1. In our applications of this lemma, we will simply choose eo := 1/4 for concreteness.D 
We give the proof of Lemma [2] for completeness. 

The subgaussian property most readily lends itself to bounds on linear combinations of sub- 
gaussian random variables. However, we are interested in bounding certain quadratic combinations. 
Therefore we bootstrap from the bound for linear combinations to bound the moment generating 
function of the quadratic combinations; from there, we can obtain the desired tail inequality. 

The following lemma relates the moment generating function to a tail inequality. 

Lemma 3. Let W be a non-negative random variable. For any 77 G M, 

POO 

E [exp {t]W)] -r]E[W]-l = i] / (exp (r/t) - 1) • Pr [1^ > t] • dt. 

Jo 

Proof. Integration- by-parts. □ 

The next lemma gives a tail inequality for any particular Rayleigh quotient of the empirical 
covariance matrix. 

Lemma 4. Let xi, . . . ,Xn be random vectors in such that, for some 7 > 0, 



E 



X 1 X A 



Xl, . . . , Xi — l 



L and 



E 



exp I a X, 



< exp (||a ||i7/2) for all a G 



Xl , . . . , X{ — i 

for all i = 1, . . . ,n, almost surely. For all a G M"' such that \\a\\2 = 1, and all 6 G (0, 1), 

1 " 

Pr a' I — XixJ 



i=l 



/ 3272log(l/J) , 27log(l/^) 
a > 1 + \ 1 



< 6 



and 



Pr 



a -y^XiXi 



a <l 



3272 \og{l/6) 



n 



< 6. 
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Proof. Fix a G M*^ with ||a||2 = 1. For z = 1, . . . , n, let Wi := {a^Xif, so E[Wi\ = 1. For any t > 
using Chernoff 's bounding method gives 

E[l[W^>t] I Xl,...,Xi-l] 

exp (77|a^Xi| ) > e''^ 



1 



< inf |e 

r7>0 I 

< inf fe-''^- (E 

r7>0 ' 



Xi , . . . , Xi—i 



exp 1 77a X, 



Xl , . . . , Xi—l 



+ E 



exp I — rya Xi 



Xl , . . . , Xj — 1 



< inf 1 2 exp (^-r/\/t + 7/^7/2 j | 



2^^P(-2^ 



2^,2 



1 + + 



< exp r/ + 



877^7 
1 - 2r/7 
877^72 



1 - 27/7 



and therefore 



E 



exp r] 



1=1 



< exp ( 77-7/ + 



1 - 2777 



Using Chernoff 's bounding method twice more gives 



Pr 



^ Wi > 77 + t 



,i=l 



< inf 

0<r)<l/(27) 



exp —tr] 



exp 



1 - 27/7^ 

8777^ + 7t - ^^87172 (87172 + 27t) ' 
2^2 



and 

Pr 

The claim follows. 



^Wi<n-t 



i=l 



< inf < exp ( t77 + 
»?<o ' ' 



87177272 
1 - 2777 



< exp 



32717^ 



)} 



So by Lemma [3l for any 77 < 1/(27), 

E [exp {'qWi) \ xi, . . . ,Xi_i] < 1 + 77 + 77 J (exp {rjt) - 1) • 2 exp " dt 



□ 



In order to bound the smallest and largest eigenvalues of the empirical covariance matrix, we 
apply the bound for the Rayleigh quotient in Lemma H] together with a covering argument. 

Lemma 5 ( Pisier . 19891 ). For any eo > 0, there exists Q C S'^^^ := {a G M'^ : ||a||2 = 1} of 
cardinality < (1 + 2/eo)'^ such that Va G S'^~^3q ^ Q . \\a — q\\2 < eo- 
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Proof of LemmalM Let Z" := (1/n) X^ILi '^'^ ^ := {a G M"^ : ||a||2 = 1} be the unit sphere 

m M'', and let Q C S'^'^ be an eo-cover of ^ of minimum size with respect to || • ||2- By LemmaEl 
the cardinahty of Q is at most (1 + 2/eo)'^- Let E be the event 



max 



By Lemma [Hand a union bound, Pr[i?] > 1 — 6. Now assume the event E holds. Let qq G S'^~^ 
be such that |a([(ii' — /)ao| = max{|a'''(i7 — I)a\ : a E 5'^"^} = HZ" — I\\2- Using the triangle and 
Cauchy-Schwarz inequalities, we have 

||Z - ly = |a([(Z - I)ao\ = mm\q^(IJ - I)q + al{E - I)ao - q" {S - I)q\ 

< min|gT(^ _ l)q\ + \aj {S - I)ao - q'^ - I)q\ 

= min|gT(r - I)q\ + |a([(Z - I){ao - q) - {q - ao)^(^ - I)q\ 

q£Q 

< min|g^(Z - I)q\ + HaolbH^ - lUao - qh + Wq " "olbH^ - IUqW: 

< e,o,5,„ + 2eo||Z-/||2 

so ^ 

max |Amax(^) - 1, 1 - Amin(^)} = ||^ " Ih < ^ _ ' ^^0,S,n- g 
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